# Stable Random Projection files

NOTE: the raw binary files, which are the single most useful
supplement to this paper, are not included in this bundle for space
reasons. They can be downloaded from Northeastern's institutional
repository at PLACEHOLDERXXX.



## Libraries

Three libraries for working with these models are in `/libs`: javascript, python, and R.

Each is on github in its own right, and the versions there are only for archival purposes.s

1. The [python library](https://github.com/bmschmidt/pySRP) allows training and a wide variety of access points. The README
   gives extensive examples of use.

2. The [Javascript library](https://github.com/bmschmidt/SRP-js) works as a node model and can compute the transformations of
text in-browser.

3. The [R library](https://github.com/bmschmidt/SRP-js) allows reading and transformation of files. Most of the file I/O 

The python library is by far the most developed; although the other
two are suitable in analysis pipelines, I would heavily discourage
anyone from training vector sets for distribution in R or
javascript. While a model trained in python (eg) runs effectively on javascript hashes,
because of differing regular expression implementations,
there are likely to be minor differences in the computed hashes
especially if texts include non-ascii characters.


## Data Preparation

`preparation/` includes code to build SRP vector sets.

It also includes code to parse Hathi MARC records into readable formats. I do not have permission to redistribute the original Hathi MARC records in raw form.
Parsing of MARC records was done with the code in `Bookworm-MARC/Hathi catalog builder.ipynb`; metadata was extracted from the resultant json file using
the program `jq`. I don't give the full replication chain because the derived files are also present.

The main copy of the code for parsing MARC records is on github [here](https://github.com/Bookworm-project/Bookworm-MARC).

Note: the Hathi EF features are extremely large (multiple terabytes) and are coded as stored on a separate disk, `/drobo`.

It also includes code to build randomly shuffle test, train, and validation sets on disk from a single file.


## Analysis

`tensorflow/` includes code for working with shallow neural
networks. The primary code for training neural networks is in the
folder `tf_helpers`, and is dispatched to create the 20 or so neural
network models of various dimensions plotted in the post through the
notebook `Train LC classification models exploring parameter
space.ipynb`.

`tensorflow/data_to_classify_on` includes a number of `.csv.gz` files used as input for classification. These files have been gently cleaned.

The datasets for even the checkpoints can swamp a small hard
disk. This code is coded to assume two other locations.

* `~/vector_models` (to store binary SRP files).
* `~/checkpoints` (to which tensorflow models are written).

That writes out full estimates for the test set in the following format.

```
,correct,guess,htid,prob_of_guess,prob_of_real,real,source
0,True,DS,mdp.39015053826734,0.2302258461713791,0.2302258461713791,DS,test
1,False,PR,mdp.39015064368460,0.25744369626045227,0.1621781289577484,PZ,test
2,False,PR,yale.39002006729793,0.060545604676008224,0.006721824407577515,NB,test
3,False,PS,mdp.39015061859545,0.2013528048992157,0.0015696428017690778,PF,test
4,False,D,mdp.39015027942005,0.08364219963550568,0.02086617425084114,F,test
5,False,PQ,njp.32101062529126,0.09990037232637405,0.004373961128294468,QK,test
6,False,DS,mdp.39015040758461,0.17269620299339294,0.037416260689496994,JX,test
7,False,PR,mdp.39015022038254,0.1406562328338623,0.04216545820236206,N,test
8,True,DS,uc1.b4303052,0.2040022611618042,0.2040022611618042,DS,test
```

Some inference and analysis is in `./hathi_metadata/Inference and
explanation of LC Classification classifier.ipynb`, including several
examples of random assignments on both books and Wikipedia articles.

`EDA/` includes code to perform exploratory analysis on these results,
mostly in R. This generates most of the figures for the paper: most
are in "Exploring model failure by size.Rmd".
